Fix(consolidation): populate search_vector on observation INSERT/UPDATE in consolidator by qxxaa · Pull Request #2425 · vectorize-io/hindsight

qxxaa · 2026-06-26T15:57:34Z

Problem

The consolidator creates and updates observations in memory_units without populating the search_vector column. Under the native text search extension backend, this makes all observations invisible to BM25 full-text retrieval.

This is a known gap documented in the source code itself:

# Native: the migration p4q5r6s7t8u9 dropped the GENERATED expression on
# search_vector to allow per-deployment language configuration; the
# batch insert path in ops_postgresql.insert_facts_batch now populates
# it via to_tsvector($lang, ...). This single-observation INSERT does
# not, so observations under the native backend currently land with
# NULL search_vector and are not BM25-searchable until reflected/
# re-ingested. Tracking a separate fix for that gap.

This PR fills that gap.

The raw fact batch insert path (ops_postgresql.insert_facts_batch) correctly populates search_vector with to_tsvector(). The consolidation pipeline - which produces the observation layer that recall surfaces - does not. The BM25 retrieval arm consistently returns 0 candidates for observations:

Parallel retrieval: semantic=12, bm25=0, graph=95, temporal=1

After the fix:

Parallel retrieval: semantic=12, bm25=3, graph=95, temporal=1

Root cause

Four code paths in consolidator.py write observation text without populating search_vector:

Function	Operation	Trigger
`_dedup_reconcile_create`	UPDATE	New observation merges into near-twin
`_dedup_reconcile_update`	UPDATE	Updated observation drifts into different twin
`_execute_update_action`	UPDATE	LLM rewrites existing observation
`_create_observation_directly`	INSERT	New observation created

Fix

Add conditional search_vector population to all four sites, gated on config.text_search_extension == "native". This matches the existing pattern in ops_postgresql.insert_facts_batch.

For UPDATE paths (Sites 1-3), a search_vector_clause string is conditionally appended to the SET clause:

search_vector_clause = (
    f",\n            search_vector = to_tsvector('{config.text_search_extension_native_language}'::regconfig, COALESCE($1, ''))"
    if config.text_search_extension == "native"
    else ""
)

For the INSERT path (Site 4), the existing else branch (which bundled native + pg_textsearch + pgroonga + pg_search) is split into elif "native" (with tsvector) and else (without), ensuring non-native backends continue to leave search_vector NULL as they index base text columns directly.

Backend gating rationale

The fix is restricted to text_search_extension == "native" because:

vchord: Has its own INSERT branch with tokenize() - already handles search_vector
native: Uses PostgreSQL's built-in tsvector/GIN for BM25 - needs explicit to_tsvector() population
pg_textsearch / pgroonga / pg_search: Index the raw text column directly via their own extension mechanisms - search_vector is a dummy column that should remain NULL

tsvector content: observation text only

The tsvector is built from COALESCE(text, '') - the observation's own text content only. Entity names, source fact text, and temporal metadata are intentionally excluded because:

Semantic retrieval already handles meaning-based discovery (embeddings)
Graph retrieval already handles entity-based discovery (entity links)
Temporal retrieval already handles time-based discovery (occurred_start/end, mentioned_at)

BM25's role is keyword matching on the observation's surface text - the one signal the other three arms don't cover. Enriching the tsvector with entity names or source context would create "contextual leakage" where non-central details pull general observations into narrow keyword matches.

Regression risk

None. The fix only fires when text_search_extension == "native" - all other backends are unaffected. The tsvector is built from the observation's own text content using the configured language dictionary (config.text_search_extension_native_language), matching the existing pattern in insert_facts_batch. No new parameters are added to any SQL query.

Testing

Added test_create_observation_populates_search_vector_native in tests/test_consolidation.py. Asserts that observations created via consolidation have search_vector IS NOT NULL under the native text search backend (Site 4 - the INSERT path, highest frequency).

The UPDATE path (Site 3) is verified to work via live deployment testing but not unit-tested - triggering an UPDATE deterministically requires a real LLM making consolidation decisions (the mock LLM always creates fresh observations). Sites 1-2 (dedup reconcile) are similarly impractical to trigger without controlling embedding similarity thresholds.

Additionally verified on a live deployment with ~139 observations:

Pre-patch: all observations have search_vector IS NULL, BM25 returns 0
Post-patch: new observations created by consolidation have populated tsvectors
BM25 arm returns relevant candidates (3+ observations matching keyword queries)
Semantic, graph, and temporal arms unaffected

The consolidator creates and updates observations without populating the search_vector tsvector column. Under the native text search extension, this means observations are invisible to BM25 full-text retrieval - the BM25 arm returns 0 candidates regardless of query content. Four code paths write observation text to memory_units: 1. _dedup_reconcile_create (merge into existing twin) 2. _dedup_reconcile_update (drift-merge into different twin) 3. _execute_update_action (LLM rewrite of existing observation) 4. _create_observation_directly (new observation INSERT) None populated search_vector. This patch adds conditional tsvector generation gated on config.text_search_extension == 'native', matching the existing pattern in ops_postgresql.insert_facts_batch. Non-native backends (pg_textsearch, pgroonga, pg_search) continue to leave search_vector NULL as they index base text columns directly. The INSERT path (Site 4) splits the existing else branch into an explicit elif/else to avoid applying native tsvector logic to backends that don't use it. Fixes: observations invisible to BM25 retrieval arm.

Add test for observation creation with native search vector

qxxaa · 2026-06-26T16:04:06Z

Deployment note: Existing observations created before this fix still have search_vector IS NULL. A one-time backfill is needed for full BM25 coverage:

UPDATE memory_units
SET search_vector = to_tsvector('english'::regconfig, COALESCE(text, ''))
WHERE fact_type = 'observation' AND search_vector IS NULL;

Suggest including this in the release notes. Note: the regconfig language should match the deployment's HINDSIGHT_API_TEXT_SEARCH_EXTENSION_NATIVE_LANGUAGE setting. This only applies to deployments using text_search_extension = "native" and the query can be expensive on large tables, so operators should schedule it during low traffic.

For external PG instances this is a manual step; for the default baked-in pg0 deployment it could potentially be an Alembic migration, but that's a separate decision for maintainers.

…ations The writer fix only populates search_vector for observations created or updated after deploy. Observations already written under the native backend keep a NULL search_vector and stay invisible to BM25 until re-consolidated. Add migration c3f7a1b9d2e4 to backfill them, gated on the native tsvector column type and scoped to fact_type='observation' with a NULL search_vector (idempotent). Matches the writer's text-only tsvector and the configured native language.

post-checkout/post-commit/post-merge/pre-push were git-lfs stubs picked up from the contributor's local hookspath and committed by mistake. They are unrelated to this change; the project's real .githooks/pre-commit is left intact.

…nfig, trace recorder leak) (#2491) * test(consolidation): fix dedup merge-path tests missing text-search config The dedup merge/update path builds a search_vector UPDATE clause from config.text_search_extension (+ _native_language) since #2425, but the _dedup_reconcile_create / _dedup_reconcile_update test configs only set consolidation_dedup_threshold, so the two merge-path tests raised AttributeError: 'types.SimpleNamespace' object has no attribute 'text_search_extension' on main. Add the two fields (production defaults native/english) to those configs. The clause reuses $1, so the existing positional-arg assertions are unchanged. * test(fact-extraction): pass Vertex AI settings when building LLMConfig Regression: LLMConfig was refactored to use vertexai_project_id/region/ service_account_key as-passed (the caller resolves the global-config fallback), but the llm_config fixture never forwarded them. So with the CI provider set to vertexai, LLMConfig raised "HINDSIGHT_API_LLM_VERTEXAI_PROJECT_ID is required" even though the env var was set — the test errored in the test-api job. Forward the three Vertex settings from config (mirroring MemoryEngine's own LLMConfig construction). Verified: LLMConfig(provider="vertexai", ...) now constructs with project_id passed, and still raises when it is omitted. * test(llm-provider): forward provider-specific settings in _make_llm _make_llm() built an LLMProvider from the env-selected provider without forwarding provider-specific settings, so the vertexai and litellmrouter acceptance-matrix jobs failed at construction ("VERTEXAI_PROJECT_ID is required" / "litellmrouter requires a config object"). LLMProvider uses these as-passed (it does not resolve them from global config), so forward vertexai_project_id/region/service_account_key and litellmrouter_config. * test(llm-trace): drop leaked span recorders after each test (#2229) Root cause of the flaky test_llm_trace::test_disabled_writes_no_rows: MemoryEngine.__init__ registers its LLM-trace recorder in a process-global registry, and only close() removes it. Tests that construct an engine directly (test_per_operation_llm_config, test_llm_reasoning_effort_env, etc.) never close it, leaking an ENABLED recorder. Locally those recorders' writes fail (uninitialized backend), but in CI a leaked recorder with a live backend records a later test's LLM calls into the shared DB — so test_disabled_writes_ no_rows sees rows for its bank even though its own recorder is disabled (assert N == 0). Reproduced: after test_per_operation_llm_config the registry holds 8 enabled recorders. Add an autouse fixture that snapshots the registry and removes anything a test leaked. Verified: the registry drops from 8 leaked recorders back to 1. _teardown_memory_engine already guards the fixtures; this guards direct constructions. 53 trace + leak-risk tests pass together under -n2.

…TE in consolidator (vectorize-io#2425) * fix: populate search_vector on observation INSERT/UPDATE in consolidator The consolidator creates and updates observations without populating the search_vector tsvector column. Under the native text search extension, this means observations are invisible to BM25 full-text retrieval - the BM25 arm returns 0 candidates regardless of query content. Four code paths write observation text to memory_units: 1. _dedup_reconcile_create (merge into existing twin) 2. _dedup_reconcile_update (drift-merge into different twin) 3. _execute_update_action (LLM rewrite of existing observation) 4. _create_observation_directly (new observation INSERT) None populated search_vector. This patch adds conditional tsvector generation gated on config.text_search_extension == 'native', matching the existing pattern in ops_postgresql.insert_facts_batch. Non-native backends (pg_textsearch, pgroonga, pg_search) continue to leave search_vector NULL as they index base text columns directly. The INSERT path (Site 4) splits the existing else branch into an explicit elif/else to avoid applying native tsvector logic to backends that don't use it. Fixes: observations invisible to BM25 retrieval arm. * Implement test for search vector population in observations Add test for observation creation with native search vector * style: run lint * fix(consolidation): backfill search_vector for existing native observations The writer fix only populates search_vector for observations created or updated after deploy. Observations already written under the native backend keep a NULL search_vector and stay invisible to BM25 until re-consolidated. Add migration c3f7a1b9d2e4 to backfill them, gated on the native tsvector column type and scoped to fact_type='observation' with a NULL search_vector (idempotent). Matches the writer's text-only tsvector and the configured native language. * chore: remove accidentally committed git-lfs hooks post-checkout/post-commit/post-merge/pre-push were git-lfs stubs picked up from the contributor's local hookspath and committed by mistake. They are unrelated to this change; the project's real .githooks/pre-commit is left intact. --------- Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>

qxxaa added 2 commits June 26, 2026 16:24

Implement test for search vector population in observations

8a891cc

Add test for observation creation with native search vector

qxxaa and others added 3 commits June 26, 2026 16:46

style: run lint

b960307

chore: remove accidentally committed git-lfs hooks

986fd5c

post-checkout/post-commit/post-merge/pre-push were git-lfs stubs picked up from the contributor's local hookspath and committed by mistake. They are unrelated to this change; the project's real .githooks/pre-commit is left intact.

nicoloboschi merged commit 21176f8 into vectorize-io:main Jun 30, 2026
87 checks passed

nicoloboschi mentioned this pull request Jul 1, 2026

test: fix CI regressions (Vertex/litellmrouter construction, dedup config, trace recorder leak) #2491

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix(consolidation): populate search_vector on observation INSERT/UPDATE in consolidator#2425

Fix(consolidation): populate search_vector on observation INSERT/UPDATE in consolidator#2425
nicoloboschi merged 5 commits into
vectorize-io:mainfrom
qxxaa:fix(consolidation)-observation-search-vector

qxxaa commented Jun 26, 2026

Uh oh!

qxxaa commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

qxxaa commented Jun 26, 2026

Problem

Root cause

Fix

Backend gating rationale

tsvector content: observation text only

Regression risk

Testing

Uh oh!

qxxaa commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants